Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NParallelTextCorpus #270

Merged
merged 28 commits into from
Nov 25, 2024
Merged

NParallelTextCorpus #270

merged 28 commits into from
Nov 25, 2024

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Nov 8, 2024

Here's the Machine PR needed to support the updates to Serval.


This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Nov 8, 2024

Codecov Report

Attention: Patch coverage is 86.17363% with 86 lines in your changes missing coverage. Please review.

Project coverage is 70.13%. Comparing base (9720b46) to head (54419cf).

Files with missing lines Patch % Lines
src/SIL.Machine/Corpora/NParallelTextCorpus.cs 91.11% 25 Missing and 11 partials ⚠️
src/SIL.Machine/Corpora/NParallelTextRow.cs 57.69% 6 Missing and 5 partials ⚠️
src/SIL.Machine/Corpora/MergedTextCorpus.cs 80.00% 5 Missing and 5 partials ⚠️
src/SIL.Machine/Corpora/NParallelTextCorpusBase.cs 40.00% 9 Missing ⚠️
src/SIL.Machine/Corpora/CorporaExtensions.cs 42.85% 8 Missing ⚠️
src/SIL.Machine/Corpora/TextCorpusEnumerator.cs 91.86% 6 Missing and 1 partial ⚠️
...rc/SIL.Machine/Corpora/CorpusAlignmentException.cs 0.00% 3 Missing ⚠️
src/SIL.Machine/Corpora/ParallelTextCorpus.cs 91.30% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #270      +/-   ##
==========================================
+ Coverage   70.01%   70.13%   +0.12%     
==========================================
  Files         380      385       +5     
  Lines       31816    31962     +146     
  Branches     4455     4490      +35     
==========================================
+ Hits        22275    22418     +143     
+ Misses       8508     8500       -8     
- Partials     1033     1044      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Nov 8, 2024

Fixes #266

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Nov 8, 2024

The only outstanding bullet point from the original description that I think I haven't covered is: "NParallelTextCorpora will take a parameter to determine how equivalent references are combined: choose first, combine all, or permutate all". Is this something we'll really want? Or is there a behavior we'll always prefer? Right now it's permutating.

@johnml1135
Copy link
Collaborator

-- commits line 65 at r1:
Can you clean up this history?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/CorporaExtensions.cs line 577 at r1 (raw file):

            public override IEnumerable<TextRow> GetRows(IEnumerable<string> textIds)
            {
                int indexOfInRangeRow = -1;

Can you explain in English what the indexOfInRangeRow is trying to accomplish?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/CorporaExtensions.cs line 586 at r1 (raw file):

                        .ToList();
                    IReadOnlyList<int> indices =
                        nonEmptyIndices.Count > 0 ? nonEmptyIndices : Enumerable.Range(0, nRow.N).ToList();

How does this logic work with AllRows? Are they blank rows being send down?

@Enkidu93
Copy link
Collaborator Author

src/SIL.Machine/Corpora/CorporaExtensions.cs line 586 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

How does this logic work with AllRows? Are they blank rows being send down?

Could you elaborate on the connection with allRows? Yes, there may be empty rows output if all the rows available are empty. But it would be an unusual circunstance.

@Enkidu93
Copy link
Collaborator Author

src/SIL.Machine/Corpora/CorporaExtensions.cs line 577 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Can you explain in English what the indexOfInRangeRow is trying to accomplish?

The point of the variable is to keep track of whether you previously picked a row from a particular source that was the beginning of a range. This way, you'll keep picking from that source until the range is finished.

@Enkidu93
Copy link
Collaborator Author

-- commits line 65 at r1:

Previously, johnml1135 (John Lambert) wrote…

Can you clean up this history?

Sure! Would it be simpler just to squash this when I'm merging/rebasing?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/NParallelTextCorpus.cs line 222 at r1 (raw file):

                    {
                        int compareAlignmentCorpus = -1;
                        if (AlignmentCorpus != null)

What exactly is the function of the alignment corpus? Why is it active here, but not above when not all corpora were in lock-step?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/NParallelTextCorpus.cs line 250 at r1 (raw file):

                                    && minRefIndexes.All(j => j == i || !AllRowsList[j])
                                )
                                .Any(b => b)

Many of these are cryptic logic. Let's have at least a sentence per set of logic. "doing it the right way" would likely involve creating a dozen or so sub-functions to do the intermediate processing, but that may lead to more confusion. I just want to have enough bread crumbs that when someone comes back in 3 years, they'll be able to understand what is going on here without having to spend 3 days.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/NParallelTextCorpus.cs line 250 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Many of these are cryptic logic. Let's have at least a sentence per set of logic. "doing it the right way" would likely involve creating a dozen or so sub-functions to do the intermediate processing, but that may lead to more confusion. I just want to have enough bread crumbs that when someone comes back in 3 years, they'll be able to understand what is going on here without having to spend 3 days.

Or, if you take the complicated logic in these if statements and make functions with descriptive names, like below with "rangeInfo.IsInRange && AllInRangeHaveSegments(currentIncompleteRows)". That is decodable - the above logic less so.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/NParallelTextCorpus.cs line 345 at r1 (raw file):

            object[] defaultRefs = new object[] { };
            if (rows.Any(r => r != null))

Why rows.Any if you already threw a Argument null exception above?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 9 files at r2, 1 of 3 files at r4, 1 of 1 files at r6, 1 of 1 files at r7, 6 of 6 files at r8, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit)

@ddaspit ddaspit requested a review from johnml1135 November 14, 2024 22:54
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 9 files at r2, 6 of 6 files at r8, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/CorporaExtensions.cs line 541 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Is this the way you wanted it? One question I have is in regards to the AllRows parameter: Is this the cleanest way to pass it or the best point at which to inject it?

I don't think that AllRows is useful in this context. Do we need it? If not, we could simply not pass AllRows at all.


tests/SIL.Machine.Tests/Corpora/NParallelTextCorpusTests.cs line 86 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

@ddaspit - can you bring in clarity - this may be the last issue...

@Enkidu93's description of the behavior sounds correct. This assert is correct. There should be only one row that is returned.


src/SIL.Machine/Corpora/MergedTextCorpus.cs line 16 at r8 (raw file):

        private readonly Random _random;

        public MergedTextCorpus(NParallelTextCorpus nParallelTextCorpus, MergeRule mergeRule, int seed)

I would have this accept an enumerable of text corpora instead of an NParallelTextCorpus.

@johnml1135
Copy link
Collaborator

Once Damien's concerns are resolved, I'm fine with the changes.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/CorporaExtensions.cs line 541 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think that AllRows is useful in this context. Do we need it? If not, we could simply not pass AllRows at all.

Let me see - "AllRows" is valuable when wanting to align to VRef.txt - so you get one line per vref entry (needed for AQUA as it relies on that specific format). Beyond that, I can't think of any need for "AllRows". And that should likely be it's own function, taking a single TextCorpus and then calling "AlignToVref" which would return a single TextCorpus with 1 entry per VRef. Therefore, it would not be needed when "merging" or "choosing random" but merely in adding another extension to format a single existing corpus. @ddaspit, does this sound right to you?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/CorporaExtensions.cs line 541 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Let me see - "AllRows" is valuable when wanting to align to VRef.txt - so you get one line per vref entry (needed for AQUA as it relies on that specific format). Beyond that, I can't think of any need for "AllRows". And that should likely be it's own function, taking a single TextCorpus and then calling "AlignToVref" which would return a single TextCorpus with 1 entry per VRef. Therefore, it would not be needed when "merging" or "choosing random" but merely in adding another extension to format a single existing corpus. @ddaspit, does this sound right to you?

Actually, should we remove "AllRows" entirely from NParallelCorpus and rather just have a separate function that takes a single ITextCorpus to "AlignToVref"?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/CorporaExtensions.cs line 541 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Actually, should we remove "AllRows" entirely from NParallelCorpus and rather just have a separate function that takes a single ITextCorpus to "AlignToVref"?

We could also just rip out "AllRows" right now and then add the new function later, as it is not needed now.

@johnml1135
Copy link
Collaborator

Actually, should there be an ability to filter out non-scripture? What is the best way to do that?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/CorporaExtensions.cs line 541 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

We could also just rip out "AllRows" right now and then add the new function later, as it is not needed now.

We shouldn't remove AllRows altogether. We just don't need to pass it in this function. Also, there is already an "AlignToVref" function, it is called ExtractScriptureCorpus.

@Enkidu93
Copy link
Collaborator Author

src/SIL.Machine/Corpora/CorporaExtensions.cs line 541 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We shouldn't remove AllRows altogether. We just don't need to pass it in this function. Also, there is already an "AlignToVref" function, it is called ExtractScriptureCorpus.

I think that AllRows is generally useful not just for us but perhaps even for others using the Machine library. One place we need it on the Serval side is when we align the pretranslate source and the train target: Since many rows will not have a target, this is necessary. Also, we use it currently (see the other PR) when doing the AlignManys since each corpus has already been filtered by textId/scriptureRange, meaning that in many cases, they won't overlap.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the Serval PR, you can see an example (although it might be a bit outdated) of the intended use. At the moment, I'm filtering out non-Scripture where necessary on that side.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/CorporaExtensions.cs line 541 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I think that AllRows is generally useful not just for us but perhaps even for others using the Machine library. One place we need it on the Serval side is when we align the pretranslate source and the train target: Since many rows will not have a target, this is necessary. Also, we use it currently (see the other PR) when doing the AlignManys since each corpus has already been filtered by textId/scriptureRange, meaning that in many cases, they won't overlap.

(Sorry, just seeing your new message, Damien!) To respond to your earlier message more directly, at the moment where we're using it in Serval (see the other PR), we're passing in AllRows=[true,...,true], so maybe that ought to be the default? I'm not sure. It's less awkward to pass AllRows in with the API I was originally imagining; I'm only thinking of this now because it seems odd to pass in AllRows (which has to do with alignment) to a function that has to do with selection.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/MergedTextCorpus.cs line 16 at r8 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would have this accept an enumerable of text corpora instead of an NParallelTextCorpus.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r9, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/CorporaExtensions.cs line 541 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

(Sorry, just seeing your new message, Damien!) To respond to your earlier message more directly, at the moment where we're using it in Serval (see the other PR), we're passing in AllRows=[true,...,true], so maybe that ought to be the default? I'm not sure. It's less awkward to pass AllRows in with the API I was originally imagining; I'm only thinking of this now because it seems odd to pass in AllRows (which has to do with alignment) to a function that has to do with selection.

Yes, I think we only need AllRows=[true,...,true] for the MergedTextCorpus to work properly. We shouldn't need to pass in the allRows parameter.


src/SIL.Machine/Corpora/CorporaExtensions.cs line 544 at r9 (raw file):

            this IEnumerable<ITextCorpus> corpora,
            IEnumerable<bool> allRows,
            int seed

We should make seed an optional parameter.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've addressed everybody's comments. I'll wait for any more feedback you all have; let me know then I'll re-review and merge.

Reviewable status: 8 of 12 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/CorporaExtensions.cs line 541 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yes, I think we only need AllRows=[true,...,true] for the MergedTextCorpus to work properly. We shouldn't need to pass in the allRows parameter.

Done.


src/SIL.Machine/Corpora/CorporaExtensions.cs line 544 at r9 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We should make seed an optional parameter.

Done.


tests/SIL.Machine.Tests/Corpora/NParallelTextCorpusTests.cs line 86 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

@Enkidu93's description of the behavior sounds correct. This assert is correct. There should be only one row that is returned.

Done.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Actually, I think I'll add at least one test to cover a case John uncovered).

Reviewable status: 8 of 12 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r10, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/CorporaExtensions.cs line 544 at r9 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Done.

You forgot to make it optional in the ChooseRandom function.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've added a test to cover a case that wasn't covered. And I uncovered that the code that had not previously run actually breaks the behavior for the situation in which it would run. @ddaspit, do you remember what particular case the corresponding bit of logic in the original ParallelTextCorpus logic was meant to cover? I kept it for consistency but it wasn't covered by the tests previously either.

Reviewable status: 10 of 13 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/CorporaExtensions.cs line 544 at r9 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You forgot to make it optional in the ChooseRandom function.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't remember what that check was for. I'm okay with the change, since the new test is testing for the correct behavior and all other tests pass.

Reviewed 3 of 3 files at r11, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Excellent work.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @johnml1135 from a discussion.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

@Enkidu93
Copy link
Collaborator Author

@johnml1135, it's saying I can't merge without your approval. I tried to dismiss you from the discussion because I think your question was answered, but it's still not letting me. If you could take a look when you have time, I'd appreciate it :)

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r10, 3 of 3 files at r11, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@johnml1135 johnml1135 merged commit 9724d44 into master Nov 25, 2024
4 checks passed
@johnml1135 johnml1135 deleted the nparallel_corpus branch November 25, 2024 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants